Unsupervised Learning from URL Corpora
نویسندگان
چکیده
This paper illustrates the utility of URL information in unsupervised learning. We outline the motivation behind the usage of URL information upfront, and present two techniques for unsupervised learning from URL corpora. First, we devise a similarity measure for URL pairs putting down the intuitions behind the same and verify its goodness by using it for clustering. Further, we outline a method for keyword identification using the similarity measure. Then, we explore the usage of character N-grams for flat clustering of URL corpora. The motivation to keep URLs compact forces the usage of a lot of variations of the same word, which is a very unique kind of forcibly inserted noise. N-gram based models are very tolerant to such noise and are very compact too. We then compare the two similarity measures using a rank correlation measure, which reveals that their similarity is very subjective on the corpus used and depends on the distinctiveness of the clusters in the corpus. Given that URLs are small entities, our techniques are magnitudes faster than unsupervised techniques on full-text corpora and require far less information than the latter. To the best of our knowledge, this is the first attempt on unsupervised learning from URL information.
منابع مشابه
UnURL: Unsupervised Learning from URLs
Web pages are identified by their URLs. For authoritative web pages, pages that are focused on a specific topic, webmasters tend to use URLs which summarize the page. URL information is good for clustering because, they are small and ubiquitous, making techniques based on just URL information magnitudes faster than those which make use of the text content as well. We present a system that makes...
متن کاملUnsupervised Neural Machine Translation
In spite of the recent success of neural machine translation (NMT) in standard benchmarks, the lack of large parallel corpora poses a major practical problem for many language pairs. There have been several proposals to alleviate this issue with, for instance, triangulation and semi-supervised learning techniques, but they still require a strong cross-lingual signal. In this work, we completely...
متن کاملDisentangling from Babylonian Confusion - Unsupervised Language Identification
This work presents an unsupervised solution to language identification. The method sorts multilingual text corpora on the basis of sentences into the different languages that are contained and makes no assumptions on the number or size of the monolingual fractions. Evaluation on 7-lingual corpora and bilingual corpora show that the quality of classification is comparable to supervised approache...
متن کاملA Statistical Model for Unsupervised and Semi-supervised Transliteration Mining
We propose a novel model to automatically extract transliteration pairs from parallel corpora. Our model is efficient, language pair independent and mines transliteration pairs in a consistent fashion in both unsupervised and semi-supervised settings. We model transliteration mining as an interpolation of transliteration and non-transliteration sub-models. We evaluate on NEWS 2010 shared task d...
متن کاملUnsupervised Lexical Learning with Categorial Grammars using the LLL Corpus
In this paper we report on an unsupervised approach to learning Categorial Grammar (CG) lexicons. The learner is provided with a set of possible lexical CG categories , the forward and backward application rules of CG and unmarked positive only corpora. Using the categories and rules, the sentences from the corpus are probabilis-tically parsed. The parses and the history of previously parsed se...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006